Introduction - clustering analysis of beer review dataset¶
This notebook performs an in-depth clustering analysis on a dataset containing various beer characteristics, including sensory ratings, chemical properties, and user reviews. The goal is to identify meaningful groupings of beers based on different subsets of features.
Key Steps Covered:¶
Optimal Parameter Selection
- For KMeans, the Elbow Method and Silhouette Score are used to determine the best number of clusters.
- For Gaussian Mixture Models (GMM), the Bayesian Information Criterion (BIC) is used.
- For DBSCAN, the k-distance graph helps estimate an appropriate epsilon.
Model Comparison
- The following clustering models are tested:
KMeans, GaussianMixture, DBSCAN, HDBSCAN, MeanShift, AgglomerativeClustering. - Each model is evaluated using:
- Silhouette Score (cluster separation)
- Calinski-Harabasz Score (cluster compactness)
- Davies-Bouldin Score (cluster similarity)
- The following clustering models are tested:
Visualization
- PCA is used to project high-dimensional data into 2D space for visual inspection.
- Radar plots visualize average feature values across clusters for:
- Full feature set
- Feature subsets: Sensory, Profile, Chemical, Reviews
Export
- Radar plots are saved to projects/proj_3_team_5/plots/ for reporting and further analysis.
Importing and data load¶
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from dotenv import load_dotenv
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, MeanShift, HDBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from kneed import KneeLocator
np.random.seed(42)
while any(marker in os.getcwd() for marker in ('exercises', 'notebooks', 'students', 'research', 'projects')):
os.chdir("..")
sys.path.append('.')
load_dotenv('projects/proj_3_team_5/.env')
df_path = os.getenv('PREPROCESSED_DATA_DIR')
df_cleaned_path = os.getenv('CLEANED_DATA_DIR')
df = pd.read_csv(df_path)
df_raw = pd.read_csv(df_cleaned_path)
Optimal parameter selection¶
# For KMeans, we use the Elbow method (inertia) and the Silhouette Score to select the optimal number of clusters.
# - The Elbow method helps identify the point where adding more clusters does not significantly reduce the within-cluster sum of squares (inertia), indicating a suitable number of clusters.
# - The Silhouette Score measures how similar an object is to its own cluster compared to other clusters, providing a quantitative metric for cluster quality.
# For GaussianMixture, we use the Bayesian Information Criterion (BIC) to select the optimal number of components.
# - BIC penalizes model complexity while rewarding goodness of fit, making it suitable for model selection in probabilistic clustering like GMM.
# - Lower BIC values indicate a better model, balancing fit and complexity.
inertia = []
silhouette = []
bic = []
n_range = range(2, 11)
for n in n_range:
kmeans = KMeans(n_clusters=n, random_state=42)
labels = kmeans.fit_predict(df)
inertia.append(kmeans.inertia_)
silhouette.append(silhouette_score(df, labels))
gmm = GaussianMixture(n_components=n, covariance_type='full', random_state=42)
gmm_labels = gmm.fit_predict(df)
bic.append(gmm.bic(df))
# Plot KMeans elbow (inertia) and silhouette
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(n_range, inertia, marker='o')
plt.title('KMeans Elbow Method (Inertia)')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.subplot(1, 2, 2)
plt.plot(n_range, silhouette, marker='o')
plt.title('KMeans Silhouette Score')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.tight_layout()
plt.show()
# Plot GMM BIC
plt.figure(figsize=(6, 4))
plt.plot(n_range, bic, marker='o')
plt.title('GaussianMixture BIC')
plt.xlabel('Number of components')
plt.ylabel('BIC')
plt.tight_layout()
plt.show()
# For DBSCAN, we use the k-distance graph to estimate the optimal value of eps (the neighborhood radius).
# - The k-distance plot helps visualize the distance to the k-th nearest neighbor for each point, sorted in ascending order.
# - The "elbow" in this plot suggests a threshold where points start to become outliers, which is a good candidate for eps.
k = 7
neigh = NearestNeighbors(n_neighbors=k)
nbrs = neigh.fit(df)
distances, indices = nbrs.kneighbors(df)
k_distances = np.sort(distances[:, k-1])
plt.figure(figsize=(6, 4))
plt.plot(k_distances)
plt.title('DBSCAN k-distance Graph')
plt.xlabel('Points sorted by distance')
plt.ylabel(f'{k}th Nearest Neighbor Distance')
plt.tight_layout()
plt.show()
Based on the KMeans parameter tuning:
- The Elbow Method shows a visible bend around $k = 8$, suggesting diminishing returns in inertia reduction beyond this point.
- However, the Silhouette Score is highest at $k = 2$ (approximately $0.12$), and drops sharply for higher values of $k$, even becoming negative, which indicates poor cluster separation.
Therefore, considering both methods, the optimal number of clusters is: $k=2$
optimal_kmeans = 2
optimal_gmm = 2
optimal_eps = 1250
For Agglomerative Clustering, we use the same number of clusters as determined optimal for KMeans, since both are hierarchical/partitioning methods and can be compared directly.
For MeanShift and HDBSCAN, we use their default parameter estimation, as these algorithms are designed to infer the number of clusters or density structure from the data.
PCA visualization and evaluation of clustering algorithms¶
models = {
'KMeans': KMeans(n_clusters=optimal_kmeans, random_state=42),
'GaussianMixture': GaussianMixture(n_components=optimal_gmm, covariance_type='full', random_state=42),
'DBSCAN': DBSCAN(eps=optimal_eps, min_samples=k),
'HDBSCAN': HDBSCAN(),
'MeanShift': MeanShift(),
'Agglomerative': AgglomerativeClustering(n_clusters=optimal_kmeans, linkage='complete')
}
pca = PCA(n_components=2)
X_pca = pca.fit_transform(df)
metrics = {}
plt.figure(figsize=(16, 12))
for i, (name, model) in enumerate(models.items(), 1):
try:
labels = model.fit_predict(df)
except Exception as e:
labels = np.zeros(df.shape[0])
print(f"Model {name} failed: {e}")
plt.subplot(3, 2, i)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='tab10', s=10)
plt.title(f'{name} Clustering')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
if n_clusters < 2:
sil = -1
ch = -1
db = np.inf
else:
sil = silhouette_score(df, labels)
ch = calinski_harabasz_score(df, labels)
db = davies_bouldin_score(df, labels)
metrics[name] = {
'silhouette': sil,
'calinski_harabasz': ch,
'davies_bouldin': db,
'n_clusters': n_clusters
}
plt.tight_layout()
plt.show()
Selection of best clustering model¶
sensory_cols = ['review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'review_overall']
profile_cols = ['Alcohol', 'Bitter', 'Sweet', 'Sour', 'Salty', 'Fruits', 'Hoppy', 'Spices', 'Malty', 'Astringency', 'Body']
chemical_cols = ['ABV', 'Min IBU', 'Max IBU']
review_cols = ['number_of_reviews']
metrics_df = pd.DataFrame(metrics).T
display(metrics_df)
# Choose the best clustering based on silhouette score (higher is better)
valid_metrics = metrics_df[metrics_df['n_clusters'] > 1]
if not valid_metrics.empty:
best_model_name = valid_metrics['silhouette'].idxmax()
print(f"Best clustering model: {best_model_name}")
else:
best_model_name = metrics_df['silhouette'].idxmax()
print(f"Best clustering model (by default): {best_model_name}")
best_model = models[best_model_name]
best_labels = best_model.fit_predict(df)
# Radar plot for the best clustering
radar_cols = sensory_cols + profile_cols + chemical_cols + review_cols
radar_cols = [col for col in radar_cols if col in df.columns]
# Compute mean for each cluster
cluster_means = pd.DataFrame(df[radar_cols])
cluster_means['cluster'] = best_labels
cluster_means = cluster_means.groupby('cluster').mean()
# Prepare radar plot
categories = radar_cols
N = len(categories)
angles = np.linspace(0, 2 * np.pi, N, endpoint=False).tolist()
angles += angles[:1] # close the loop
plt.figure(figsize=(8, 8))
for idx, (cluster, row) in enumerate(cluster_means.iterrows()):
values = row.values.flatten().tolist()
values += values[:1] # close the loop
plt.polar(angles, values, label=f'Cluster {cluster}', linewidth=2)
plt.xticks(angles[:-1], categories, color='grey', size=10)
plt.title(f'Radar Plot of Cluster Means for {best_model_name}', size=15, y=1.08)
plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
plt.tight_layout()
plt.savefig(f"projects/proj_3_team_5/plots/radar_{best_model_name}_overall.png", dpi=300)
plt.show()
# Add cluster labels to the original dataframe
df_with_clusters = df_raw.copy()
df_with_clusters['cluster'] = best_labels
# Show five sample beers from each cluster only if number of clusters is smaller than 10
n_clusters = len(df_with_clusters['cluster'].unique())
if n_clusters < 10:
print(f"\n=== Sample Beers from Each Cluster ({best_model_name}) ===")
for cluster in sorted(df_with_clusters['cluster'].unique()):
cluster_beers = df_with_clusters[df_with_clusters['cluster'] == cluster]
print(f"\nCluster {cluster} ({len(cluster_beers)} beers):")
# Sample 5 beers from this cluster
sample_beers = cluster_beers.sample(n=min(5, len(cluster_beers)), random_state=42)
display(sample_beers)
else:
print(f"\nSkipping sample display - too many clusters ({n_clusters})")
| silhouette | calinski_harabasz | davies_bouldin | n_clusters | |
|---|---|---|---|---|
| KMeans | 0.115463 | 193.196781 | 3.617928 | 2.0 |
| GaussianMixture | 0.115463 | 193.196781 | 3.617928 | 2.0 |
| DBSCAN | -1.000000 | -1.000000 | inf | 1.0 |
| HDBSCAN | -0.093229 | 30.772725 | 3.962694 | 2.0 |
| MeanShift | 0.079885 | 2.358320 | 0.640568 | 410.0 |
| Agglomerative | 0.060042 | 87.414205 | 5.002694 | 2.0 |
Best clustering model: KMeans
=== Sample Beers from Each Cluster (KMeans) === Cluster 0 (1093 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1219 | Dogtoberfest | Lager - Märzen / Oktoberfest | Flying Dog Brewery | Flying Dog Brewery Dogtoberfest | There is sauerkraut in my lederhosen. I repeat... | 5.6 | 18 | 25 | 17 | 46 | ... | 69 | 10 | 115 | 3.479839 | 3.685484 | 3.527218 | 3.553427 | 3.639113 | 496 | 0 |
| 284 | Mai-Ur-Bock | Bock - Maibock | Einbecker Brauhaus AG | Einbecker Brauhaus AG Einbecker Mai-Ur-Bock | “Ready for May?” In spring, the Einbecker brew... | 6.5 | 20 | 38 | 14 | 33 | ... | 74 | 10 | 112 | 3.688525 | 3.872951 | 3.827869 | 3.858607 | 3.862705 | 244 | 0 |
| 151 | Pitchfork Rebellious Bitter | Bitter - English | RCH Brewery | RCH Brewery Pitchfork Rebellious Bitter | The name comes from the Pitchfork rebellion of... | 4.3 | 20 | 35 | 34 | 49 | ... | 112 | 12 | 64 | 3.423729 | 3.949153 | 3.627119 | 3.711864 | 3.889831 | 59 | 0 |
| 2026 | Warlock | Stout - American Imperial | Southern Tier Brewing Company | Southern Tier Brewing Company Warlock | Imperial stout brewed with pumpkins Warlock is... | 8.6 | 50 | 80 | 3 | 50 | ... | 9 | 52 | 76 | 3.625000 | 4.125000 | 3.875000 | 4.000000 | 4.000000 | 4 | 0 |
| 1810 | Gaelic Ale | Red Ale - American Amber / Red | Highland Brewing | Highland Brewing Highland Gaelic Ale | A deep amber colored American ale, featuring a... | 5.8 | 25 | 45 | 7 | 23 | ... | 74 | 4 | 57 | 3.665904 | 3.821510 | 3.778032 | 3.869565 | 3.964531 | 437 | 0 |
5 rows × 26 columns
Cluster 1 (1502 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1893 | Roggenbier | Rye Beer - Roggenbier | Real Ale Brewing Company | Real Ale Brewing Company Roggenbier | NaN | 4.9 | 10 | 20 | 8 | 25 | ... | 15 | 24 | 48 | 3.600000 | 3.900000 | 3.766667 | 3.733333 | 3.866667 | 15 | 1 |
| 1075 | Löwenbräu Original | Lager - Helles | Löwenbräu AG | Löwenbräu AG Löwenbräu Original | NaN | 5.2 | 18 | 25 | 22 | 29 | ... | 60 | 7 | 56 | 3.097594 | 3.304813 | 3.391711 | 3.407754 | 3.616310 | 374 | 1 |
| 783 | Deuchars IPA | IPA - English | The Caledonian Brewing Company | The Caledonian Brewing Company Deuchars IPA | 4.4% ABV in bottles and 3.8% in cask.\t | 4.4 | 35 | 60 | 27 | 53 | ... | 90 | 18 | 59 | 3.782051 | 3.722222 | 3.735043 | 3.888889 | 4.021368 | 117 | 1 |
| 769 | India Pale Ale | IPA - English | Meantime Brewing Company Limited | Meantime Brewing Company Limited India Pale Ale | NaN | 7.5 | 35 | 60 | 16 | 42 | ... | 99 | 11 | 84 | 3.847756 | 4.097756 | 3.956731 | 4.016026 | 4.028846 | 312 | 1 |
| 823 | Sünner Kölsch | Kölsch | Gebr. Sünner GmbH & Co. KG | Gebr. Sünner GmbH & Co. KG Sünner Kölsch | NaN | 4.8 | 18 | 25 | 37 | 32 | ... | 71 | 4 | 69 | 3.607759 | 3.702586 | 3.745690 | 3.732759 | 4.051724 | 116 | 1 |
5 rows × 26 columns
Clustering analysis and visualization by feature groups¶
feature_groups = {
'Sensory': sensory_cols,
'Profile': profile_cols,
'Chemical': chemical_cols,
'Reviews': review_cols
}
for group_name, columns in feature_groups.items():
metrics = {}
print(f"\n=== Feature Group: {group_name} ===")
print("Clustering results for each model:")
X = df[columns].dropna()
n_components = min(2, X.shape[0], X.shape[1])
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(20, 12))
plt.suptitle(f'Feature Group: {group_name}', fontsize=16)
# Find optimal parameters for the models using the elbow method
# KMeans: Find optimal number of clusters using elbow and silhouette score
sse = []
silhouette_scores = []
k_range = range(2, min(11, X.shape[0]))
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(X)
sse.append(kmeans.inertia_)
# Only compute silhouette if more than 1 cluster
if len(set(labels)) > 1:
sil = silhouette_score(X, labels)
else:
sil = -1
silhouette_scores.append(sil)
# Find elbow point (simple heuristic: where the decrease sharply slows)
if len(sse) > 2:
elbow_k = k_range[np.argmin(np.diff(sse, 2)) + 1]
else:
elbow_k = k_range[0]
# Find k with maximum silhouette score
best_sil_k = k_range[np.argmax(silhouette_scores)]
# Combine: pick k that is closest to elbow_k but also has high silhouette (within 90% of max)
sil_threshold = 0.9 * max(silhouette_scores)
candidate_ks = [k for k, sil in zip(k_range, silhouette_scores) if sil >= sil_threshold]
if candidate_ks:
optimal_kmeans = min(candidate_ks, key=lambda k: abs(k - elbow_k))
else:
optimal_kmeans = elbow_k
# GaussianMixture: Use same optimal number of components as KMeans
optimal_gmm = optimal_kmeans
# DBSCAN: Find optimal eps using k-distance graph (elbow method)
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X)
distances, indices = nbrs.kneighbors(X)
distances = np.sort(distances[:, 1])
# Heuristic: take the point of maximum curvature as optimal eps
kneedle = KneeLocator(range(len(distances)), distances, S=1.0, curve="convex", direction="increasing")
optimal_eps = distances[kneedle.knee] if kneedle.knee is not None else np.percentile(distances, 90)
k = 2 # min_samples
# Agglomerative: Use same optimal number of clusters as KMeans
optimal_agglom = optimal_kmeans
models = {
'KMeans': KMeans(n_clusters=optimal_kmeans, random_state=42),
'GaussianMixture': GaussianMixture(n_components=optimal_gmm, covariance_type='full', random_state=42),
'DBSCAN': DBSCAN(eps=optimal_eps, min_samples=k),
'HDBSCAN': HDBSCAN(),
'MeanShift': MeanShift(),
'Agglomerative': AgglomerativeClustering(n_clusters=optimal_agglom, linkage='complete')
}
for i, (model_name, model) in enumerate(models.items(), 1):
labels = model.fit_predict(X)
plt.subplot(2, 3, i)
if n_components == 1:
plt.scatter(X_pca[:, 0], [0]*len(X_pca), c=labels, cmap='tab10', s=10)
plt.xlabel('PCA 1')
plt.ylabel('0 (no 2nd PCA component)')
else:
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='tab10', s=10)
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
if n_clusters < 2:
sil = -1
ch = -1
db = np.inf
else:
sil = silhouette_score(X, labels)
ch = calinski_harabasz_score(X, labels)
db = davies_bouldin_score(X, labels)
plt.title(f'{model_name}')
metrics[model_name] = {
'silhouette': sil,
'calinski_harabasz': ch,
'davies_bouldin': db,
'n_clusters': n_clusters
}
plt.tight_layout()
plt.show()
print("Model metrics for this group:")
metrics_df = pd.DataFrame(metrics).T
display(metrics_df)
# Plot radar plot for the best model in this group
valid_metrics = metrics_df[metrics_df['n_clusters'] > 1]
if not valid_metrics.empty:
best_model_name = valid_metrics['silhouette'].idxmax()
else:
best_model_name = metrics_df['silhouette'].idxmax()
print(f"Best clustering model for {group_name}: {best_model_name}")
best_model = models[best_model_name]
best_labels = best_model.fit_predict(X)
# Compute mean for each cluster
cluster_means = pd.DataFrame(X, columns=columns)
cluster_means['cluster'] = best_labels
cluster_means = cluster_means.groupby('cluster').mean()
# Prepare radar plot
categories = columns
N = len(categories)
angles = np.linspace(0, 2 * np.pi, N, endpoint=False).tolist()
angles += angles[:1] # close the loop
plt.figure(figsize=(8, 8))
for idx, (cluster, row) in enumerate(cluster_means.iterrows()):
values = row.values.flatten().tolist()
values += values[:1] # close the loop
plt.polar(angles, values, label=f'Cluster {cluster}', linewidth=2)
plt.xticks(angles[:-1], categories, color='grey', size=10)
plt.title(f'Radar Plot of Cluster Means for {best_model_name} ({group_name})', size=15, y=1.08)
plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
plt.tight_layout()
plt.savefig(f"projects/proj_3_team_5/plots/radar__{best_model_name}_{group_name.lower()}.png", dpi=300)
plt.show()
# Add cluster labels to the original dataframe
df_with_clusters = df_raw.copy()
df_with_clusters['cluster'] = best_labels
# Show five sample beers from each cluster only if number of clusters is smaller than 10
n_clusters = len(df_with_clusters['cluster'].unique())
if n_clusters < 10:
print(f"\n=== Sample Beers from Each Cluster ({best_model_name}) ===")
for cluster in sorted(df_with_clusters['cluster'].unique()):
cluster_beers = df_with_clusters[df_with_clusters['cluster'] == cluster]
print(f"\nCluster {cluster} ({len(cluster_beers)} beers):")
# Sample 5 beers from this cluster
sample_beers = cluster_beers.sample(n=min(5, len(cluster_beers)), random_state=42)
display(sample_beers)
else:
print(f"\nSkipping sample display - too many clusters ({n_clusters})")
=== Feature Group: Sensory === Clustering results for each model:
Model metrics for this group:
| silhouette | calinski_harabasz | davies_bouldin | n_clusters | |
|---|---|---|---|---|
| KMeans | 0.490976 | 3242.767989 | 0.769278 | 2.0 |
| GaussianMixture | 0.205715 | 71.971462 | 4.538508 | 2.0 |
| DBSCAN | 0.242044 | 3.420279 | 5.444213 | 3.0 |
| HDBSCAN | -0.116752 | 60.261845 | 1.972067 | 4.0 |
| MeanShift | 0.237655 | 397.893592 | 1.191438 | 9.0 |
| Agglomerative | 0.512676 | 2762.645432 | 0.702818 | 2.0 |
Best clustering model for Sensory: Agglomerative
=== Sample Beers from Each Cluster (Agglomerative) === Cluster 0 (2104 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1223 | Munsterfest | Lager - Märzen / Oktoberfest | Three Floyds Brewing Co. & Brewpub | Three Floyds Brewing Co. & Brewpub Munsterfest | NaN | 6.0 | 18 | 25 | 18 | 46 | ... | 50 | 11 | 95 | 3.649733 | 3.679144 | 3.745989 | 3.748663 | 3.893048 | 187 | 0 |
| 395 | George | Brown Ale - American | Hill Farmstead Brewery | Hill Farmstead Brewery George | George was our grandfather’s brother, and Hill... | 6.0 | 25 | 45 | 14 | 68 | ... | 67 | 7 | 133 | 4.000000 | 4.083333 | 4.000000 | 3.958333 | 3.916667 | 12 | 0 |
| 2084 | Best Extra Stout | Stout - Foreign / Export | Coopers Brewery Limited | Coopers Brewery Limited Coopers Best Extra Stout | Now here's a beer with punch. Coopers Best Ext... | 6.3 | 30 | 70 | 5 | 65 | ... | 43 | 10 | 98 | 3.753482 | 3.905292 | 3.710306 | 3.892758 | 3.870474 | 359 | 0 |
| 327 | Mountain Holidays In Vermont | Bock - Traditional | Rock Art Brewery | Rock Art Brewery Mountain Holidays In Vermont ... | NaN | 5.8 | 20 | 30 | 12 | 74 | ... | 47 | 26 | 119 | 3.701299 | 3.837662 | 3.805195 | 3.746753 | 3.824675 | 77 | 0 |
| 209 | St. Feuillien Blonde | Blonde Ale - Belgian | Brasserie St. Feuillien | Brasserie St. Feuillien St. Feuillien Blonde | NaN | 7.5 | 15 | 30 | 29 | 38 | ... | 63 | 32 | 35 | 3.829787 | 3.904255 | 3.723404 | 3.904255 | 3.936170 | 47 | 0 |
5 rows × 26 columns
Cluster 1 (491 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2279 | Faithfull Ale | Strong Ale - Belgian Pale | Dogfish Head Brewery | Dogfish Head Brewery Faithfull Ale | Faithfull Ale is a celebration of Pearl Jam's ... | 7.0 | 20 | 40 | 24 | 48 | ... | 43 | 19 | 47 | 3.289286 | 3.632143 | 3.460714 | 3.278571 | 3.300000 | 140 | 1 |
| 683 | Ta Henket | Herb and Spice Beer | Dogfish Head Brewery | Dogfish Head Brewery Ta Henket | For this ambitious liquid time capsule, we use... | 4.5 | 5 | 40 | 7 | 25 | ... | 55 | 26 | 60 | 3.250000 | 3.490385 | 3.471154 | 3.355769 | 3.365385 | 52 | 1 |
| 1900 | Orkiszowe | Rye Beer - Roggenbier | Browar Kormoran | Browar Kormoran Orkiszowe | NaN | 5.1 | 10 | 20 | 0 | 0 | ... | 0 | 0 | 0 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 1 | 1 |
| 2438 | Benediktiner Weissbier Dunkel | Wheat Beer - Dunkelweizen | Klosterbrauerei Ettal / Ettaler Klosterbetrieb... | Klosterbrauerei Ettal / Ettaler Klosterbetrieb... | NaN | 5.4 | 10 | 15 | 5 | 26 | ... | 11 | 15 | 57 | 3.250000 | 3.750000 | 3.250000 | 3.500000 | 3.500000 | 2 | 1 |
| 1871 | Station 33 Firehouse Red | Red Ale - Irish | North Country Brewing | North Country Brewing Station 33 Firehouse Red | NaN | 5.5 | 20 | 30 | 18 | 47 | ... | 44 | 2 | 114 | 3.147059 | 3.705882 | 3.411765 | 3.441176 | 3.411765 | 17 | 1 |
5 rows × 26 columns
=== Feature Group: Profile === Clustering results for each model:
Model metrics for this group:
| silhouette | calinski_harabasz | davies_bouldin | n_clusters | |
|---|---|---|---|---|
| KMeans | 0.207494 | 504.609572 | 1.465888 | 6.0 |
| GaussianMixture | -0.012702 | 175.887219 | 2.307591 | 6.0 |
| DBSCAN | -0.056300 | 13.582412 | 1.617519 | 5.0 |
| HDBSCAN | -0.419559 | 11.526929 | 1.582308 | 21.0 |
| MeanShift | -1.000000 | -1.000000 | inf | 1.0 |
| Agglomerative | 0.132923 | 346.715836 | 1.797764 | 6.0 |
Best clustering model for Profile: KMeans
=== Sample Beers from Each Cluster (KMeans) === Cluster 0 (551 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2525 | Namaste White Belgian-Style Witbier | Wheat Beer - Witbier | Dogfish Head Brewery | Dogfish Head Brewery Namaste | A witbier bursting with good karma. Made with ... | 4.8 | 10 | 20 | 14 | 29 | ... | 32 | 29 | 32 | 3.990291 | 3.987055 | 3.896440 | 3.998382 | 4.085761 | 309 | 0 |
| 594 | Malmgård Jouluolut | Farmhouse Ale - Sahti | Malmgårdin Panimo | Malmgårdin Panimo Malmgård Jouluolut | NaN | 4.5 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 3.000000 | 3.500000 | 3.500000 | 2.500000 | 3.500000 | 1 | 0 |
| 811 | Smetoniška Gira | Kvass | Vofas-Engelman | Vofas-Engelman Smetonika Gira | NaN | 1.2 | 0 | 0 | 2 | 4 | ... | 1 | 1 | 16 | 3.166667 | 3.500000 | 3.666667 | 3.166667 | 3.500000 | 3 | 0 |
| 592 | Finlandia Sahti | Farmhouse Ale - Sahti | Finlandia Sahti Ky | Finlandia Sahti Ky Finlandia Sahti | NaN | 8.0 | 0 | 0 | 7 | 18 | ... | 17 | 6 | 19 | 4.500000 | 3.000000 | 3.000000 | 3.500000 | 3.500000 | 2 | 0 |
| 1119 | Organic Beer Shinshu Sansan | Lager - Japanese Rice | Yo-Ho Brewing Company | Yo-Ho Brewing Company Organic Beer Shinshu Sansan | NaN | 5.0 | 6 | 18 | 7 | 7 | ... | 15 | 1 | 18 | 3.333333 | 3.000000 | 3.000000 | 3.000000 | 3.500000 | 6 | 0 |
5 rows × 26 columns
Cluster 1 (294 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 559 | Green's Endeavour Dubbel Dark Ale | Dubbel | Green's Gluten Free Beers | Green's Gluten Free Beers Green's Endeavour | NaN | 7.0 | 15 | 30 | 13 | 42 | ... | 27 | 16 | 64 | 2.935185 | 3.583333 | 2.481481 | 2.509259 | 2.509259 | 54 | 1 |
| 208 | Blond | Blonde Ale - Belgian | Brasserie de l'Abbaye du Val-Dieu | Brasserie de l'Abbaye du Val-Dieu Val-Dieu Blond | NaN | 6.0 | 15 | 30 | 33 | 35 | ... | 61 | 33 | 56 | 3.771277 | 3.941489 | 3.840426 | 3.856383 | 3.984043 | 94 | 1 |
| 1355 | Lou Pepe - Gueuze | Lambic - Gueuze | Brasserie Cantillon | Brasserie Cantillon Cantillon Lou Pepe - Gueuze | NaN | 5.0 | 0 | 10 | 23 | 23 | ... | 6 | 2 | 5 | 4.335938 | 4.113281 | 4.230469 | 4.378906 | 4.406250 | 128 | 1 |
| 2555 | Consecration | Wild Ale | Russian River Brewing Company | Russian River Brewing Company Consecration | Dark ale aged in Cabernet Sauvignon barrels wi... | 10.0 | 5 | 30 | 31 | 28 | ... | 9 | 9 | 16 | 4.366114 | 4.091232 | 4.283768 | 4.473934 | 4.296801 | 844 | 1 |
| 2527 | Blanche De Chambly | Wheat Beer - Witbier | Unibroue | Unibroue Blanche De Chambly | The Blanche de Chambly label features the icon... | 5.0 | 10 | 20 | 17 | 33 | ... | 42 | 34 | 52 | 3.805526 | 3.837407 | 3.756111 | 3.840064 | 3.971307 | 941 | 1 |
5 rows × 26 columns
Cluster 2 (576 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1298 | Rusty Chain | Lager - Vienna | Flying Bison Brewing Company | Flying Bison Brewing Company Rusty Chain | The #1 best selling local craft beer in Buffal... | 5.2 | 15 | 30 | 14 | 45 | ... | 48 | 7 | 80 | 3.250000 | 3.562500 | 3.625000 | 3.625000 | 3.625000 | 8 | 2 |
| 461 | Pullman Nut Brown | Brown Ale - English | Flossmoor Station Restaurant & Brewery | Flossmoor Station Restaurant & Brewery Pullman... | A traditional english brown ale, very nutty ar... | 6.0 | 15 | 25 | 8 | 96 | ... | 28 | 4 | 177 | 3.983553 | 3.917763 | 3.960526 | 4.078947 | 4.046053 | 152 | 2 |
| 1819 | Captain Sig's Northwestern Ale | Red Ale - American Amber / Red | Rogue Ales | Rogue Ales Captain Sig's Northwestern Ale | Label of 22oz bottle:10 Ingredients: Pale 2-ro... | 6.2 | 25 | 45 | 6 | 55 | ... | 107 | 9 | 112 | 3.759939 | 4.027523 | 3.808869 | 3.799694 | 3.844037 | 327 | 2 |
| 2145 | Black Magic Stout | Stout - Irish Dry | Empire Brewing Company | Empire Brewing Company Black Magic Stout | A traditional dry Irish stout, carbonated with... | 4.8 | 30 | 40 | 43 | 100 | ... | 44 | 12 | 114 | 3.625000 | 4.375000 | 3.541667 | 3.625000 | 3.750000 | 12 | 2 |
| 2034 | Obsidian Stout | Stout - American | Deschutes Brewery | Deschutes Brewery Obsidian Stout | Deep, robust and richly rewarding, this is bee... | 6.4 | 35 | 60 | 9 | 72 | ... | 67 | 6 | 132 | 4.077964 | 4.304768 | 4.121134 | 4.266753 | 4.250000 | 776 | 2 |
5 rows × 26 columns
Cluster 3 (564 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1916 | Rye IPA | Rye Beer | Black Market Brewing Co. | Black Market Brewing Co. Rye IPA | NaN | 7.5 | 10 | 80 | 17 | 31 | ... | 89 | 18 | 75 | 3.650000 | 3.750000 | 3.650000 | 3.700000 | 3.700000 | 10 | 3 |
| 1847 | Lavery Imperial Red Ale | Red Ale - Imperial | Lavery Brewing Company | Lavery Brewing Company Lavery Imperial Red Ale | BIG. HOPPY. RED. Irish beer gone incognito! Ou... | 8.2 | 55 | 85 | 16 | 37 | ... | 76 | 2 | 44 | 3.666667 | 4.166667 | 3.833333 | 3.750000 | 3.833333 | 6 | 3 |
| 874 | Molson Ice | Lager - Adjunct | Molson Coors Canada | Molson Coors Canada Molson Ice | NaN | 5.6 | 8 | 18 | 21 | 18 | ... | 38 | 3 | 55 | 2.309816 | 2.812883 | 2.684049 | 2.625767 | 2.815951 | 163 | 3 |
| 1857 | O'Hara's Irish Red | Red Ale - Irish | Carlow Brewing Company | Carlow Brewing Company O'Hara's Irish Red | NaN | 4.3 | 20 | 30 | 15 | 40 | ... | 68 | 2 | 99 | 3.505952 | 3.830357 | 3.500000 | 3.511905 | 3.684524 | 168 | 3 |
| 1623 | Evil Power | Pilsner - Imperial | Three Floyds Brewing Co. & Brewpub | Three Floyds Brewing Co. & Brewpub Evil Power | A fortified European-style Pilsner lagered to ... | 7.2 | 30 | 65 | 37 | 50 | ... | 100 | 9 | 82 | 3.428571 | 3.803571 | 3.607143 | 3.508929 | 3.464286 | 56 | 3 |
5 rows × 26 columns
Cluster 4 (209 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 544 | Benediction | Dubbel | Russian River Brewing Company | Russian River Brewing Company Benediction | Brown in color, Benediction has notes aromas a... | 6.75 | 15 | 30 | 11 | 47 | ... | 32 | 27 | 58 | 3.960000 | 3.880000 | 4.120000 | 4.180000 | 4.280000 | 25 | 4 |
| 2455 | Dancing Man | Wheat Beer - Hefeweizen | New Glarus Brewing Company | New Glarus Brewing Company Dancing Man Wheat | If you dream of wheat this brew will get your ... | 7.20 | 10 | 15 | 13 | 46 | ... | 16 | 69 | 59 | 4.249315 | 4.236986 | 4.228767 | 4.335616 | 4.371233 | 365 | 4 |
| 1770 | UFO Pumpkin | Pumpkin Beer | Harpoon Brewery | Harpoon Brewery UFO Pumpkin | Imagine a pumpkin vine wound its way in a fiel... | 5.90 | 5 | 70 | 9 | 37 | ... | 27 | 75 | 59 | 3.703125 | 3.710938 | 3.562500 | 3.531250 | 3.570312 | 64 | 4 |
| 2566 | Samuel Adams Old Fezziwig AleBoston Beer Compa... | Winter Warmer | Boston Beer Company (Samuel Adams) | Boston Beer Company (Samuel Adams) Samuel Adam... | Old Fezziwig, Rich & Sweet: Like the character... | 5.90 | 35 | 50 | 6 | 35 | ... | 10 | 64 | 78 | 3.777236 | 3.906911 | 3.747561 | 3.843089 | 3.810163 | 1230 | 4 |
| 1518 | Hell's Belle | Pale Ale - Belgian | Big Boss Brewing | Big Boss Brewing Hell's Belle | NaN | 7.00 | 20 | 30 | 19 | 36 | ... | 44 | 40 | 46 | 3.522222 | 3.627778 | 3.605556 | 3.538889 | 3.650000 | 90 | 4 |
5 rows × 26 columns
Cluster 5 (401 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1955 | Railbender Ale | Scottish Ale | Erie Brewing Co. | Erie Brewing Co. Railbender Ale | Erie Brewing Company flagship beer features a ... | 6.8 | 9 | 25 | 5 | 40 | ... | 22 | 6 | 103 | 3.422559 | 3.594276 | 3.602694 | 3.624579 | 3.664983 | 297 | 5 |
| 1944 | Heavy Horse Scotch Ale | Scotch Ale / Wee Heavy | Big Sky Brewing Company | Big Sky Brewing Company Heavy Horse Scotch Ale | NaN | 6.7 | 25 | 35 | 12 | 61 | ... | 37 | 14 | 119 | 3.675000 | 3.891667 | 3.741667 | 3.691667 | 3.716667 | 60 | 5 |
| 71 | Holidale | Barleywine - American | Berkshire Brewing Company Inc. | Berkshire Brewing Company Inc. Holidale | NaN | 9.5 | 60 | 100 | 15 | 72 | ... | 69 | 27 | 107 | 3.831325 | 3.987952 | 3.906627 | 3.990964 | 3.891566 | 166 | 5 |
| 1467 | Thanksgiving Ale | Old Ale | Mayflower Brewing Company | Mayflower Brewing Company Mayflower Thanksgivi... | The first and only perennial offering in our C... | 6.7 | 30 | 65 | 18 | 69 | ... | 51 | 53 | 151 | 3.826389 | 3.798611 | 3.881944 | 3.972222 | 3.972222 | 72 | 5 |
| 303 | Southampton May Bock | Bock - Maibock | Southampton Publick House | Southampton Publick House Southampton May Bock | NaN | 6.5 | 20 | 38 | 23 | 57 | ... | 43 | 11 | 116 | 3.835227 | 3.880682 | 3.977273 | 4.113636 | 4.221591 | 88 | 5 |
5 rows × 26 columns
=== Feature Group: Chemical === Clustering results for each model:
Model metrics for this group:
| silhouette | calinski_harabasz | davies_bouldin | n_clusters | |
|---|---|---|---|---|
| KMeans | 0.515126 | 2293.056586 | 0.852369 | 2.0 |
| GaussianMixture | 0.469307 | 1347.788702 | 1.145447 | 2.0 |
| DBSCAN | 0.248944 | 312.481734 | 1.674888 | 15.0 |
| HDBSCAN | 0.429212 | 31.598181 | 1.986562 | 222.0 |
| MeanShift | 0.436874 | 1198.007788 | 0.772829 | 3.0 |
| Agglomerative | 0.442224 | 2029.390041 | 0.983633 | 2.0 |
Best clustering model for Chemical: KMeans
=== Sample Beers from Each Cluster (KMeans) === Cluster 0 (2072 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1558 | English Ale | Pale Ale - English | St. Peter's Brewery Co Ltd | St. Peter's Brewery Co Ltd St. Peter's English... | NaN | 4.5 | 20 | 40 | 38 | 50 | ... | 98 | 3 | 78 | 3.445455 | 3.543182 | 3.577273 | 3.615909 | 3.747727 | 220 | 0 |
| 906 | Cypress Honey Lager | Lager - American Amber / Red | Granville Island Brewery | Granville Island Brewery Cypress Honey Lager | Brewed in small batches, our Cypress Honey Lag... | 4.7 | 18 | 30 | 11 | 34 | ... | 26 | 2 | 132 | 3.000000 | 3.250000 | 3.339286 | 3.125000 | 3.446429 | 28 | 0 |
| 851 | Labatt Blue | Lager - Adjunct | Labatt Brewing Company Ltd. | Labatt Brewing Company Ltd. Labatt Blue | Labatt Blue is the best-selling Canadian beer ... | 5.0 | 8 | 18 | 16 | 22 | ... | 20 | 2 | 32 | 2.406096 | 2.690280 | 2.696870 | 2.672982 | 3.060956 | 607 | 0 |
| 1692 | Bully! Porter | Porter - English | Boulevard Brewing Co. | Boulevard Brewing Co. Bully! Porter | The intense flavors of dark-roasted malt in Bo... | 6.0 | 20 | 30 | 10 | 79 | ... | 36 | 14 | 99 | 3.804393 | 4.125523 | 3.791841 | 3.991632 | 4.001046 | 478 | 0 |
| 482 | Red Bird Ale | California Common / Steam Beer | Portsmouth Brewing Co. / Mault's Brewpub | Portsmouth Brewing Co. / Mault's Brewpub Ports... | An American Home Run! Named for the 1939 Ports... | 4.8 | 35 | 45 | 8 | 22 | ... | 26 | 3 | 30 | 3.333333 | 3.500000 | 3.250000 | 3.083333 | 3.250000 | 6 | 0 |
5 rows × 26 columns
Cluster 1 (523 beers):
| Name | Style | Brewery | Beer Name (Full) | Description | ABV | Min IBU | Max IBU | Astringency | Body | ... | Hoppy | Spices | Malty | review_aroma | review_appearance | review_palate | review_taste | review_overall | number_of_reviews | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2593 | Winter Shredder | Winter Warmer | Cisco Brewers Inc. | Cisco Brewers Inc. Winter Shredder | NaN | 8.8 | 35 | 50 | 15 | 37 | ... | 45 | 67 | 74 | 4.125000 | 3.875000 | 3.875000 | 3.750000 | 4.000000 | 4 | 1 |
| 2212 | Life and Limb | Strong Ale - American | Sierra Nevada Brewing Co. | Sierra Nevada Brewing Co. Life & Limb | Brewed in collaboration with Dogfish Head Craf... | 10.2 | 40 | 100 | 6 | 52 | ... | 36 | 22 | 112 | 3.817507 | 4.114243 | 3.947329 | 3.996291 | 3.887240 | 674 | 1 |
| 42 | Olde GnarlyWine | Barleywine - American | Lagunitas Brewing Company | Lagunitas Brewing Company Olde GnarlyWine | 2011: 10.6% ABV, 69 IBU\t | 10.9 | 60 | 100 | 14 | 49 | ... | 68 | 17 | 89 | 4.052373 | 4.143208 | 4.082651 | 4.127660 | 4.016367 | 611 | 1 |
| 1450 | Fourth Dementia - Bourbon Barrel-Aged | Old Ale | Kuhnhenn Brewing Company | Kuhnhenn Brewing Company Kuhnhenn Bourbon Barr... | This is our 4th Dementia Olde Ale that has bee... | 13.5 | 30 | 65 | 13 | 61 | ... | 18 | 32 | 124 | 4.555556 | 3.941919 | 4.340909 | 4.638889 | 4.474747 | 198 | 1 |
| 698 | Furious | IPA - American | Surly Brewing Company | Surly Brewing Company Furious | A tempest on the tongue, or a moment of pure h... | 6.7 | 50 | 70 | 11 | 24 | ... | 96 | 3 | 61 | 4.374592 | 4.271650 | 4.200572 | 4.379902 | 4.336601 | 1224 | 1 |
5 rows × 26 columns
=== Feature Group: Reviews === Clustering results for each model:
Model metrics for this group:
| silhouette | calinski_harabasz | davies_bouldin | n_clusters | |
|---|---|---|---|---|
| KMeans | 0.683902 | 8787.141046 | 0.507592 | 3.0 |
| GaussianMixture | 0.503322 | 4169.878920 | 0.578554 | 3.0 |
| DBSCAN | 0.633867 | 354.255109 | 1.799954 | 20.0 |
| HDBSCAN | 0.754922 | 212.036369 | 1.638289 | 210.0 |
| MeanShift | 0.708887 | 3923.101971 | 0.412261 | 3.0 |
| Agglomerative | 0.709907 | 4861.545776 | 0.412262 | 3.0 |
Best clustering model for Reviews: HDBSCAN
/var/folders/w5/bqtvk3ss7gbf6xw9z7wqv6s80000gn/T/ipykernel_31732/1022216729.py:148: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all Axes decorations. plt.tight_layout()
Skipping sample display - too many clusters (211)